feat(pt/dpmodel): add lmdb dataloader by iProzd · Pull Request #5283 · deepmodeling/deepmd-kit

iProzd · 2026-03-03T07:56:19Z

Summary by CodeRabbit

New Features
- Full LMDB data support across training, validation and testing with PyTorch-friendly dataset/dataloader integration, per-atom-count (nloc) grouping, and deterministic distributed batching.
Tests
- Extensive unit and integration tests covering LMDB reader, dataset/dataloader, samplers, collation, and mixed-/uniform-nloc scenarios.
Documentation
- Added an example LMDB training configuration and updated test runner to handle LMDB-grouped test runs.
Chores
- Added runtime dependencies for LMDB and msgpack support.

+    def test_getitem_out_of_range(self, lmdb_dir):
+        ds = LmdbDataset(lmdb_dir, type_map=["O", "H"], batch_size=2)
+        with pytest.raises(IndexError):
+            ds[999]


coderabbitai · 2026-03-03T08:03:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds comprehensive LMDB support: a framework-agnostic LMDB reader, merge/test utilities, PyTorch LMDB Dataset/collation/samplers, trainer/test and test entrypoint LMDB branches, an example LMDB config, extensive unit tests, and new lmdb/msgpack dependencies.

Changes

Cohort / File(s)	Summary
LMDB core utils `deepmd/dpmodel/utils/lmdb_data.py`, `deepmd/dpmodel/utils/__init__.py`	New LMDB data layer implementing `LmdbDataReader`, `LmdbTestData`, `merge_lmdb`, per-frame decoding/remapping, frame_nloc support, and batch samplers (`SameNlocBatchSampler`, `DistributedSameNlocBatchSampler`); exports updated.
PyTorch integration `deepmd/pt/utils/lmdb_dataset.py`, `deepmd/pt/entrypoints/main.py`, `deepmd/pt/train/training.py`	Adds `LmdbDataset`, `_collate_lmdb_batch`, `_SameNlocBatchSamplerTorch`; integrates LMDB paths into trainer/entrypoint flows and dataloader creation; updates type hints and conditional logic to accept LMDB datasets.
Test entrypoint `deepmd/entrypoints/test.py`	Detects LMDB sources via `is_lmdb`, uses `LmdbTestData`, creates per-nloc views when present and runs tests per-group; preserves non-LMDB behavior.
Examples `examples/lmdb_data/input_lmdb.json`	Adds an example LMDB training configuration demonstrating LMDB train/validation paths and batch settings.
Tests `source/tests/consistent/test_lmdb_data.py`, `source/tests/pt/test_lmdb_dataloader.py`	Large test suites validating reader↔dataset consistency, uniform/mixed-nloc scenarios, sampler determinism, collate behavior, distributed batching, and test-data utilities.
Project deps `pyproject.toml`	Adds `lmdb` and `msgpack` to project dependencies.
Misc exports `deepmd/dpmodel/utils/__init__.py`, `deepmd/pt/utils/lmdb_dataset.py`	Public exports updated to expose LMDB-related symbols (`LmdbDataReader`, `LmdbDataset`, `LmdbTestData`, samplers, `is_lmdb`, collate helper).

Sequence Diagrams

sequenceDiagram
    participant Trainer as Trainer
    participant Main as prepare_trainer_input_single
    participant Checker as is_lmdb
    participant DS as LmdbDataset / DpLoaderSet
    participant Reader as LmdbDataReader
    participant LMDB as LMDB DB
    participant DL as PyTorch DataLoader

    Trainer->>Main: request(train_systems, val_systems)
    Main->>Checker: is_lmdb(train_systems)
    alt LMDB path
        Main->>DS: create LmdbDataset(train_path)
        Main->>DS: create LmdbDataset(val_path) or DpLoaderSet
    else Non-LMDB
        Main->>DS: create DpLoaderSet(...)
    end
    Main-->>Trainer: (train_data, val_data, DPPath)

    Trainer->>DL: get_data_loader(train_data)
    DL->>DS: initialize (sampler, collate_fn)
    DS->>Reader: Reader.__init__(lmdb_path)
    Reader->>LMDB: open read-only txn and read __metadata__

    loop per-batch
        DL->>Reader: Reader.__getitem__(frame_id)
        Reader->>LMDB: fetch frame bytes
        Reader->>Reader: decode arrays & remap keys
        Reader-->>DL: frame dict (np arrays)
        DL->>DL: _collate_lmdb_batch(batch)
        DL-->>Trainer: batched tensors
    end

sequenceDiagram
    participant App as Test CLI
    participant LD as LmdbTestData
    participant View as _LmdbTestDataNlocView
    participant TestRunner as test functions

    App->>LD: LmdbTestData(lmdb_path)
    LD->>LD: compute nloc_groups
    alt multiple nloc groups
        LD->>View: create per-nloc views
        loop per nloc
            App->>TestRunner: run tests(view, label)
        end
    else single group
        App->>TestRunner: run tests(LD, label)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

njzjz
wanghan-iapcm

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(pt/dpmodel): add lmdb dataloader' directly and specifically describes the main change: adding LMDB dataloader functionality to the PyTorch (pt) and DPModel (dpmodel) modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

deepmd/pt/train/training.py (1)

529-542: ⚠️ Potential issue | 🔴 Critical

Multi-task LMDB path still assumes .sampler.weights and can crash.

In LMDB mode, dataloaders are batch-sampler based and do not provide sampler weights. The current multi-task code still dereferences .sampler.weights in both summary printing and num_epoch_dict total-batch computation.

Proposed fix

                 training_data[model_key].print_summary(
                     f"training in {model_key}",
-                    to_numpy_array(self.training_dataloader[model_key].sampler.weights),
+                    to_numpy_array(self.training_dataloader[model_key].sampler.weights)
+                    if not isinstance(training_data[model_key], LmdbDataset)
+                    else [1.0],
                 )
@@
                     validation_data[model_key].print_summary(
                         f"validation in {model_key}",
-                        to_numpy_array(
-                            self.validation_dataloader[model_key].sampler.weights
-                        ),
+                        to_numpy_array(
+                            self.validation_dataloader[model_key].sampler.weights
+                        )
+                        if not isinstance(validation_data[model_key], LmdbDataset)
+                        else [1.0],
                     )
@@
                 for model_key in self.model_keys:
-                    sampler_weights = to_numpy_array(
-                        self.training_dataloader[model_key].sampler.weights
-                    )
-                    per_task_total.append(
-                        compute_total_numb_batch(
-                            training_data[model_key].index,
-                            sampler_weights,
-                        )
-                    )
+                    if isinstance(training_data[model_key], LmdbDataset):
+                        per_task_total.append(training_data[model_key].total_batch)
+                    else:
+                        sampler_weights = to_numpy_array(
+                            self.training_dataloader[model_key].sampler.weights
+                        )
+                        per_task_total.append(
+                            compute_total_numb_batch(
+                                training_data[model_key].index,
+                                sampler_weights,
+                            )
+                        )

Also applies to: 583-590

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 529 - 542, The code assumes
sampler.weights exists for training/validation dataloaders (see
training_data[model_key].print_summary calls and the num_epoch_dict total-batch
computation) which breaks for LMDB batch-sampler dataloaders; fix by guarding
access to sampler.weights: when printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 242-253: The batch_size handling silently coerces unknown strings
and accepts non-positive integers; update the branch in the batch_size setter so
that only "auto" or "auto:<positive int>" strings are allowed (parse "auto:<n>"
into an int and validate n>0), otherwise raise a ValueError with a clear
message; likewise, when batch_size is numeric (the else branch that sets
self.batch_size = int(batch_size)) validate the parsed integer is >0 and raise a
ValueError if not; keep using _compute_batch_size(self._natoms, self._auto_rule)
for the "auto" case but ensure _auto_rule is a positive int after parsing.
- Line 374: The local variable unique_nlocs assigned from
sorted(self._nloc_groups.keys()) is unused and triggers a Ruff F841 lint error;
remove this assignment statement (the unused local unique_nlocs) from the method
where it appears so the code no longer creates unique_nlocs but retains any
needed logic that uses self._nloc_groups directly.
- Around line 880-882: The merge logic assumes src_nlocs has at least nframes
entries and does frame_nlocs.append(int(src_nlocs[i])) without bounds checking;
change this to guard the index: if src_nlocs is not None and i < len(src_nlocs)
then append int(src_nlocs[i]) else compute/derive the per-frame natoms (e.g., by
parsing the frame from the source data or using the fallback natoms value
already available) so that malformed/truncated metadata does not raise
IndexError during the merge; update the block around the frame_nlocs population
(references: frame_nlocs, src_nlocs, nframes) to perform this fallback behavior.
- Around line 705-719: In get_test(), handle the empty-dataset case before using
max on self._nloc_groups: if self._nloc_groups is empty (or len(self._frames) ==
0) raise a clear ValueError (e.g. "No frames in LMDB / no nloc groups
available") instead of letting max(...) raise a cryptic error; otherwise proceed
with the existing logic that sets natoms and frame_indices from the largest
group in self._nloc_groups.
- Around line 263-267: Before building natoms_vec/vec, validate that all atom
type ids in atype are within [0, self._ntypes-1]; detect any ids <0 or >=
self._ntypes and raise a descriptive ValueError (including the offending ids or
their min/max and, if available, the frame index) instead of silently truncating
via np.bincount(... )[: self._ntypes]. Place this check immediately before the
np.bincount call that computes counts (the code using atype and self._ntypes and
filling vec/natoms_vec) so you fail fast and keep natoms_vec consistent with the
frame contents.

In `@deepmd/entrypoints/test.py`:
- Around line 233-266: The append_detail flag is incorrectly derived from cc so
LMDB sub-groups for the first system overwrite detail files; instead track
whether a given sys_label has already been written and pass append_detail=True
for subsequent sub-groups. Modify the loop over data_items: introduce a small
seen map (e.g., seen_systems dict keyed by sys_label) and compute append_detail
= seen_systems.get(sys_label, False) when calling
test_ener/test_dos/test_property (replace the current append_detail=(cc != 0));
after the call set seen_systems[sys_label] = True so later sub-groups append
rather than overwrite.

In `@source/tests/consistent/test_lmdb_data.py`:
- Line 128: The loop variable j in the loop "for j in range(count):" is unused;
rename it to "_" (or "_i"/"_count" per project convention) to satisfy Ruff B007
and avoid lint failures, update any related references in that loop (none
expected), and re-run ruff check . and ruff format . to ensure formatting and
linting pass.
- Around line 405-407: The current test contains a tautological assertion
self.assertEqual(atype.shape[1], atype.shape[1]) which does not validate batch
consistency; replace it with an assertion that compares the nloc dimension
across frames in the batch (use atype as the per-batch array) — e.g. compute a
reference_nloc from the first frame (reference_nloc = atype[0].shape[1]) and
assert every frame's shape[1] equals reference_nloc (or use numpy/all to check
atype[:,1?] consistency) so the test in test_lmdb_data.py actually verifies "All
frames in batch have same nloc".

In `@source/tests/pt/test_lmdb_dataloader.py`:
- Line 480: The test contains Ruff violations: the loop uses an unused loop
variable i and the variable s_default is assigned but never used; fix by
changing the loop to use a throwaway name (e.g., replace "for i in range(10):"
with "for _ in range(10):") or otherwise use i, and remove or use the s_default
assignment (remove the dead "s_default = ..." line if it's not needed or
reference it where intended) so there are no unused variables left.

---

Outside diff comments:
In `@deepmd/pt/train/training.py`:
- Around line 529-542: The code assumes sampler.weights exists for
training/validation dataloaders (see training_data[model_key].print_summary
calls and the num_epoch_dict total-batch computation) which breaks for LMDB
batch-sampler dataloaders; fix by guarding access to sampler.weights: when
printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3db25a and 5a10d4d.

📒 Files selected for processing (9)

deepmd/dpmodel/utils/__init__.py
deepmd/dpmodel/utils/lmdb_data.py
deepmd/entrypoints/test.py
deepmd/pt/entrypoints/main.py
deepmd/pt/train/training.py
deepmd/pt/utils/lmdb_dataset.py
examples/lmdb_data/input_lmdb.json
source/tests/consistent/test_lmdb_data.py
source/tests/pt/test_lmdb_dataloader.py

codecov · 2026-03-03T08:32:13Z

Codecov Report

❌ Patch coverage is 82.91457% with 170 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.39%. Comparing base (efc27cf) to head (72736f7).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/dpmodel/utils/lmdb_data.py	89.87%	75 Missing ⚠️
deepmd/pt/utils/lmdb_dataset.py	64.74%	55 Missing ⚠️
deepmd/entrypoints/test.py	56.09%	18 Missing ⚠️
deepmd/pt/train/training.py	60.71%	11 Missing ⚠️
deepmd/pt/entrypoints/main.py	60.00%	10 Missing ⚠️
deepmd/utils/data_system.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5283      +/-   ##
==========================================
+ Coverage   82.38%   82.39%   +0.01%     
==========================================
  Files         812      814       +2     
  Lines       83611    84574     +963     
  Branches     4091     4090       -1     
==========================================
+ Hits        68882    69687     +805     
- Misses      13508    13667     +159     
+ Partials     1221     1220       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

deepmd/dpmodel/utils/lmdb_data.py (5)

248-260: ⚠️ Potential issue | 🟡 Minor

Validate batch_size to prevent silent misconfiguration.

Unknown string specs (e.g., "auto_invalid") fall through to self._auto_rule = 32 at line 256, hiding config mistakes. Non-positive integers are also accepted at line 260.

Suggested fix

         self._auto_rule: int | None = None
         if isinstance(batch_size, str):
             if batch_size == "auto":
                 self._auto_rule = 32
             elif batch_size.startswith("auto:"):
-                self._auto_rule = int(batch_size.split(":")[1])
+                try:
+                    self._auto_rule = int(batch_size.split(":", 1)[1])
+                except ValueError as exc:
+                    raise ValueError(
+                        f"Invalid batch_size spec: {batch_size!r}"
+                    ) from exc
+                if self._auto_rule <= 0:
+                    raise ValueError("auto batch_size rule must be > 0")
             else:
-                self._auto_rule = 32
+                raise ValueError(
+                    f"batch_size must be 'auto', 'auto:N', or a positive int, got {batch_size!r}"
+                )
             self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
         else:
             self.batch_size = int(batch_size)
+            if self.batch_size <= 0:
+                raise ValueError("batch_size must be > 0")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 248 - 260, The batch_size
parsing in the constructor silently treats unknown string specs and non-positive
integers as valid (setting self._auto_rule = 32 or accepting non-positive ints),
so add explicit validation: in the batch_size handling for strings
(function/class using self._auto_rule and _compute_batch_size) only accept
"auto" or "auto:<positive_int>" and raise a ValueError for any other string
spec; when batch_size is numeric (the else branch that sets self.batch_size =
int(batch_size)) validate that the parsed int is > 0 and raise ValueError if
not; ensure error messages reference the provided batch_size value and keep use
of _compute_batch_size(self._natoms, self._auto_rule) unchanged when the auto
rule is valid.

974-976: ⚠️ Potential issue | 🟡 Minor

Guard against truncated frame_nlocs metadata during merge.

If src_nlocs has fewer entries than nframes (malformed metadata), src_nlocs[i] raises IndexError.

Suggested fix

-                if src_nlocs is not None:
+                if src_nlocs is not None and i < len(src_nlocs):
                     frame_nlocs.append(int(src_nlocs[i]))
                 else:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 974 - 976, The code indexing
src_nlocs[i] can raise IndexError when src_nlocs is shorter than nframes; in the
block that checks "if src_nlocs is not None:" (where
frame_nlocs.append(int(src_nlocs[i])) is called), first verify i <
len(src_nlocs) (or use a safe iterator) and if the entry is missing append a
sensible default (e.g., 0) or otherwise handle the malformed metadata (log/warn
and append 0) instead of indexing past the end; update the logic around
frame_nlocs, src_nlocs and nframes to perform this bounds check before
casting/indexing.

785-808: ⚠️ Potential issue | 🟡 Minor

Handle empty datasets in get_test().

If the LMDB has zero frames, self._nloc_groups is empty and max(self._nloc_groups, ...) at line 798 raises ValueError: max() arg is an empty sequence.

Suggested fix

         if nloc is not None:
             if nloc not in self._nloc_groups:
                 raise ValueError(
                     f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
                 )
             frame_indices = self._nloc_groups[nloc]
             natoms = nloc
+        elif not self._frames:
+            raise ValueError("LMDB dataset contains no frames")
         elif len(self._nloc_groups) == 1:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 785 - 808, The method that
selects frames (inside get_test / the function using self._nloc_groups and
_frames) fails when the LMDB is empty because max(...) is called on an empty
self._nloc_groups; add an early check for empty dataset (e.g., if not
self._nloc_groups or not self._frames) and handle it by either raising a clear
ValueError ("No frames in LMDB") or returning an appropriate empty result before
reaching the mixed-nloc branch; update the branches that reference natoms and
frame_indices (the code that computes natoms = max(...) and frame_indices = ...)
so they only run when groups exist to avoid the max() on empty sequence.

436-447: ⚠️ Potential issue | 🟡 Minor

Remove unused unique_nlocs variable (Ruff F841).

unique_nlocs is assigned but never used. This will cause CI failure per coding guidelines.

Suggested fix

     def print_summary(self, name: str, prob: Any) -> None:
         """Print basic dataset info."""
-        unique_nlocs = sorted(self._nloc_groups.keys())
         nloc_info = ", ".join(
             f"{nloc}({len(idxs)})" for nloc, idxs in sorted(self._nloc_groups.items())
         )

As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 436 - 447, In print_summary
(method print_summary) the local variable unique_nlocs is assigned from
self._nloc_groups but never used; remove the unused assignment (unique_nlocs =
sorted(self._nloc_groups.keys())) so the method only builds nloc_info from
self._nloc_groups and logs it, eliminating the Ruff F841 unused-variable error.

265-276: ⚠️ Potential issue | 🟡 Minor

Guard against out-of-range atom type IDs.

np.bincount(...)[: self._ntypes] truncates counts for type IDs >= self._ntypes, making natoms_vec inconsistent with frame contents.

Suggested fix

     def _compute_natoms_vec(self, atype: np.ndarray) -> np.ndarray:
         nloc = len(atype)
+        if atype.size and (np.min(atype) < 0 or np.max(atype) >= self._ntypes):
+            raise ValueError(
+                f"atype contains out-of-range IDs; expected [0, {self._ntypes - 1}]"
+            )
         counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 265 - 276,
_compute_natoms_vec currently truncates counts for atom type IDs >=
self._ntypes; to fix, validate the atype array before counting: compute max_id =
atype.max() (handle empty atype properly) and if max_id >= self._ntypes or any
id < 0 raise a clear ValueError referencing _compute_natoms_vec and include the
offending max_id and self._ntypes in the message; otherwise use
np.bincount(atype, minlength=self._ntypes) to build counts and keep the existing
vec[0]=nloc, vec[1]=nloc, vec[2:]=counts logic.

🧹 Nitpick comments (2)

deepmd/pt/train/training.py (1)

263-265: Consider documenting the _reader access pattern.

_data._reader accesses a private attribute of LmdbDataset. While acceptable within the same package, consider either making this a public property or adding a brief comment explaining this coupling is intentional for distributed sampler construction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 263 - 265, The code is accessing a
private attribute _data._reader of LmdbDataset when constructing the
DistributedSameNlocBatchSampler; update the codebase to avoid relying on a
private member by either exposing a public property/method on LmdbDataset (e.g.,
reader or get_reader) and using that, or add a concise inline comment next to
the DistributedSameNlocBatchSampler construction documenting that accessing
_reader is an intentional coupling for distributed sampler creation and should
not be refactored without updating the sampler; reference symbols:
_data._reader, DistributedSameNlocBatchSampler, LmdbDataset.

deepmd/dpmodel/utils/lmdb_data.py (1)

869-869: Rename unused loop variable to _req_info.

req_info is not used within the loop body (Ruff B007). Rename to _req_info to indicate it's intentionally unused.

-        for key, req_info in all_keys.items():
+        for key, _req_info in all_keys.items():

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` at line 869, The loop over all_keys uses
an unused variable name req_info which triggers a lint warning; update the for
statement in lmdb_data.py (the loop "for key, req_info in all_keys.items():") to
rename req_info to _req_info so it becomes "for key, _req_info in
all_keys.items():" to indicate the variable is intentionally unused (no other
changes needed if req_info is not referenced elsewhere).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 554-555: The reported mismatch comes from using
LmdbDataset.total_batch (which does nframes // batch_size) to set
total_numb_batch while SameNlocBatchSampler.__len__ uses per-nloc ceiling
division; fix by computing total_numb_batch from the actual sampler instead of
total_batch: when using SameNlocBatchSampler (or when auto batch sizing / mixed
nloc are possible) set total_numb_batch = len(sampler) or recompute it by
summing ceil(group_size / batch_size_for_that_nloc) so num_steps (used later)
matches the sampler’s actual number of batches; update the block in training.py
that currently uses training_data.total_batch to use
SameNlocBatchSampler.__len__ (or equivalent ceiling-sum logic) so the two agree.

---

Duplicate comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 248-260: The batch_size parsing in the constructor silently treats
unknown string specs and non-positive integers as valid (setting self._auto_rule
= 32 or accepting non-positive ints), so add explicit validation: in the
batch_size handling for strings (function/class using self._auto_rule and
_compute_batch_size) only accept "auto" or "auto:<positive_int>" and raise a
ValueError for any other string spec; when batch_size is numeric (the else
branch that sets self.batch_size = int(batch_size)) validate that the parsed int
is > 0 and raise ValueError if not; ensure error messages reference the provided
batch_size value and keep use of _compute_batch_size(self._natoms,
self._auto_rule) unchanged when the auto rule is valid.
- Around line 974-976: The code indexing src_nlocs[i] can raise IndexError when
src_nlocs is shorter than nframes; in the block that checks "if src_nlocs is not
None:" (where frame_nlocs.append(int(src_nlocs[i])) is called), first verify i <
len(src_nlocs) (or use a safe iterator) and if the entry is missing append a
sensible default (e.g., 0) or otherwise handle the malformed metadata (log/warn
and append 0) instead of indexing past the end; update the logic around
frame_nlocs, src_nlocs and nframes to perform this bounds check before
casting/indexing.
- Around line 785-808: The method that selects frames (inside get_test / the
function using self._nloc_groups and _frames) fails when the LMDB is empty
because max(...) is called on an empty self._nloc_groups; add an early check for
empty dataset (e.g., if not self._nloc_groups or not self._frames) and handle it
by either raising a clear ValueError ("No frames in LMDB") or returning an
appropriate empty result before reaching the mixed-nloc branch; update the
branches that reference natoms and frame_indices (the code that computes natoms
= max(...) and frame_indices = ...) so they only run when groups exist to avoid
the max() on empty sequence.
- Around line 436-447: In print_summary (method print_summary) the local
variable unique_nlocs is assigned from self._nloc_groups but never used; remove
the unused assignment (unique_nlocs = sorted(self._nloc_groups.keys())) so the
method only builds nloc_info from self._nloc_groups and logs it, eliminating the
Ruff F841 unused-variable error.
- Around line 265-276: _compute_natoms_vec currently truncates counts for atom
type IDs >= self._ntypes; to fix, validate the atype array before counting:
compute max_id = atype.max() (handle empty atype properly) and if max_id >=
self._ntypes or any id < 0 raise a clear ValueError referencing
_compute_natoms_vec and include the offending max_id and self._ntypes in the
message; otherwise use np.bincount(atype, minlength=self._ntypes) to build
counts and keep the existing vec[0]=nloc, vec[1]=nloc, vec[2:]=counts logic.

---

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Line 869: The loop over all_keys uses an unused variable name req_info which
triggers a lint warning; update the for statement in lmdb_data.py (the loop "for
key, req_info in all_keys.items():") to rename req_info to _req_info so it
becomes "for key, _req_info in all_keys.items():" to indicate the variable is
intentionally unused (no other changes needed if req_info is not referenced
elsewhere).

In `@deepmd/pt/train/training.py`:
- Around line 263-265: The code is accessing a private attribute _data._reader
of LmdbDataset when constructing the DistributedSameNlocBatchSampler; update the
codebase to avoid relying on a private member by either exposing a public
property/method on LmdbDataset (e.g., reader or get_reader) and using that, or
add a concise inline comment next to the DistributedSameNlocBatchSampler
construction documenting that accessing _reader is an intentional coupling for
distributed sampler creation and should not be refactored without updating the
sampler; reference symbols: _data._reader, DistributedSameNlocBatchSampler,
LmdbDataset.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e42a3a8f-33f3-46a5-b818-47bcea016afe

📥 Commits

Reviewing files that changed from the base of the PR and between 5a10d4d and f91ba7d.

📒 Files selected for processing (3)

deepmd/dpmodel/utils/lmdb_data.py
deepmd/pt/train/training.py
pyproject.toml

coderabbitai

🧹 Nitpick comments (2)

deepmd/dpmodel/utils/lmdb_data.py (2)
892-896: Rename unused loop variable req_info to _req_info (Ruff B007).

The loop variable req_info is not used within the loop body. Rename it to _req_info to signal intentional non-use and satisfy Ruff.
🧹 Proposed fix
-        for key, req_info in all_keys.items():
+        for key, _req_info in all_keys.items():
             has_key = any(
                 key in f and isinstance(f.get(key), np.ndarray) for f in frames
             )
As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 892 - 896, The loop in
lmdb_data.py iterates with an unused variable req_info (for key, req_info in
all_keys.items()) — rename it to _req_info to indicate intentional non-use and
satisfy Ruff B007; update the for-loop variable in that block (the loop that
sets result[f"find_{key}"]) and run ruff check/ruff format before committing.
962-966: Consider safer directory removal approach.

shutil.rmtree(dst_path) on line 966 unconditionally removes the existing path. While this enables idempotent merges, consider adding a confirmation parameter or logging a warning when overwriting to prevent accidental data loss.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 962 - 966, The code
unconditionally deletes dst_path using shutil.rmtree which can cause accidental
data loss; modify the enclosing function (look for the routine that uses
dst_path and shutil.rmtree) to accept an explicit boolean flag like overwrite or
force_remove (default False), add logging (import logging) and call
logging.warning with the dst_path before deletion, and only call
shutil.rmtree(dst_path) when the flag is True (otherwise raise an error or
return); this makes removal explicit and auditable while keeping the idempotent
behavior when explicitly requested.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 892-896: The loop in lmdb_data.py iterates with an unused variable
req_info (for key, req_info in all_keys.items()) — rename it to _req_info to
indicate intentional non-use and satisfy Ruff B007; update the for-loop variable
in that block (the loop that sets result[f"find_{key}"]) and run ruff check/ruff
format before committing.
- Around line 962-966: The code unconditionally deletes dst_path using
shutil.rmtree which can cause accidental data loss; modify the enclosing
function (look for the routine that uses dst_path and shutil.rmtree) to accept
an explicit boolean flag like overwrite or force_remove (default False), add
logging (import logging) and call logging.warning with the dst_path before
deletion, and only call shutil.rmtree(dst_path) when the flag is True (otherwise
raise an error or return); this makes removal explicit and auditable while
keeping the idempotent behavior when explicitly requested.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7be23553-3c34-461f-896f-f5f2b69bf477

📥 Commits

Reviewing files that changed from the base of the PR and between f91ba7d and d9a4db4.

📒 Files selected for processing (1)

deepmd/dpmodel/utils/lmdb_data.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

deepmd/dpmodel/utils/lmdb_data.py (5)

367-371: ⚠️ Potential issue | 🟠 Major

Fail fast on invalid atype ids before remap/counting.

At Line 371 and Line 291, atype is used without bounds validation. Invalid ids can corrupt type remapping/counts or raise runtime errors in non-obvious paths. Validate range once before remap and before natoms_vec computation.

Suggested fix

         if "atype" in frame and isinstance(frame["atype"], np.ndarray):
             frame["atype"] = frame["atype"].reshape(-1).astype(np.int64)
+            if frame["atype"].size and np.min(frame["atype"]) < 0:
+                raise ValueError("atype contains negative type ids")
             # Remap atom types from LMDB's type_map to model's type_map
             if self._type_remap is not None:
+                if frame["atype"].size and np.max(frame["atype"]) >= len(self._type_remap):
+                    raise ValueError(
+                        f"atype id out of range for LMDB type_map (max allowed {len(self._type_remap)-1})"
+                    )
                 frame["atype"] = self._type_remap[frame["atype"]].astype(np.int64)
+            if frame["atype"].size and np.max(frame["atype"]) >= self._ntypes:
+                raise ValueError(
+                    f"atype contains out-of-range ids; expected [0, {self._ntypes - 1}]"
+                )

Also applies to: 290-292

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 367 - 371, Validate that all
atom type IDs in frame["atype"] are within the valid range before any remapping
or any use in natoms_vec computation: check they are non-negative integers and
less than the length/shape of self._type_remap (when self._type_remap is not
None) or less than expected num_types; if any id is out of range raise a clear
ValueError including the offending ids and context (frame id), then proceed to
reshape/astype, apply self._type_remap on valid indices, and only after
successful remap compute natoms_vec; refer to symbols frame["atype"],
self._type_remap and natoms_vec to locate where to insert the validation and
error handling.

270-281: ⚠️ Potential issue | 🟠 Major

Validate batch_size strictly; don’t silently coerce invalid specs.

At Line 276, unknown string specs are silently treated as auto:32, and at Line 280 non-positive integers are accepted. This masks config mistakes and can break batching downstream.

Suggested fix

         self._auto_rule: int | None = None
         if isinstance(batch_size, str):
             if batch_size == "auto":
                 self._auto_rule = 32
             elif batch_size.startswith("auto:"):
-                self._auto_rule = int(batch_size.split(":")[1])
+                try:
+                    self._auto_rule = int(batch_size.split(":", 1)[1])
+                except ValueError as exc:
+                    raise ValueError(
+                        "batch_size must be 'auto', 'auto:N', or a positive integer"
+                    ) from exc
+                if self._auto_rule <= 0:
+                    raise ValueError("auto batch_size rule N must be > 0")
             else:
-                self._auto_rule = 32
+                raise ValueError(
+                    "batch_size must be 'auto', 'auto:N', or a positive integer"
+                )
             # Default batch_size uses first frame's nloc (for total_batch estimate)
             self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
         else:
             self.batch_size = int(batch_size)
+            if self.batch_size <= 0:
+                raise ValueError("batch_size must be > 0")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 270 - 281, The batch_size
handling currently silently coerces invalid string specs and accepts
non-positive integers; update the block that sets self._auto_rule and
self.batch_size (around batch_size, self._auto_rule, and _compute_batch_size in
lmdb_data.py) to strictly validate inputs: accept only "auto" or "auto:<positive
int>" for string specs (parse and set self._auto_rule or raise ValueError on any
other format), and when batch_size is numeric ensure int(batch_size) > 0 (raise
ValueError otherwise), also catch and re-raise parsing errors with a clear
message so invalid configs fail fast instead of being coerced.

461-461: ⚠️ Potential issue | 🟠 Major

Remove unused local unique_nlocs (F841).

Line 461 assigns unique_nlocs but never uses it.

Suggested fix

-        unique_nlocs = sorted(self._nloc_groups.keys())
         nloc_info = ", ".join(
             f"{nloc}({len(idxs)})" for nloc, idxs in sorted(self._nloc_groups.items())
         )

As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` at line 461, The local variable
unique_nlocs is assigned via sorted(self._nloc_groups.keys()) but never used,
causing an F841; remove the unused assignment (delete the unique_nlocs line) or
replace its usage if it was intended to be used later in the same
function/method that contains self._nloc_groups; ensure you reference the
variable name unique_nlocs and the expression sorted(self._nloc_groups.keys())
when making the change, then run ruff check . and ruff format . before
committing.

848-855: ⚠️ Potential issue | 🟡 Minor

Handle empty datasets explicitly in get_test().

If the LMDB has zero frames, Line 854 calls max(...) on an empty dict and raises a cryptic error. Add an explicit empty-dataset guard.

Suggested fix

         if nloc is not None:
             if nloc not in self._nloc_groups:
                 raise ValueError(
                     f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
                 )
             frame_indices = self._nloc_groups[nloc]
             natoms = nloc
+        elif not self._frames:
+            raise ValueError("LMDB dataset contains no frames")
         elif len(self._nloc_groups) == 1:
             # Uniform nloc — use all frames
             natoms = next(iter(self._nloc_groups))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 848 - 855, get_test() can
call max(...) on an empty self._nloc_groups when the LMDB has zero frames; add
an explicit guard at the start of get_test() (or before the branching that uses
self._nloc_groups) to detect an empty dataset (e.g., if not self._frames or not
self._nloc_groups) and handle it cleanly by returning an empty result or raising
a clear exception; update the branch that sets natoms/frame_indices (the block
using self._nloc_groups, natoms, and frame_indices) to assume non-empty only
after this guard so max(...) is never called on an empty dict.

1030-1032: ⚠️ Potential issue | 🟠 Major

Guard frame_nlocs metadata indexing in merge path.

Line 1031 assumes src_nlocs has nframes entries. Truncated/malformed metadata raises IndexError and aborts merge.

Suggested fix

-                if src_nlocs is not None:
-                    frame_nlocs.append(int(src_nlocs[i]))
-                else:
+                if src_nlocs is not None and i < len(src_nlocs):
+                    frame_nlocs.append(int(src_nlocs[i]))
+                else:
                     frame_raw = msgpack.unpackb(raw, raw=False)
                     atype_raw = frame_raw.get("atom_types")
                     if isinstance(atype_raw, dict):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 1030 - 1032, The code
unconditionally indexes src_nlocs[i] when building frame_nlocs in the merge
path, which can raise IndexError for truncated/malformed metadata; update the
merge logic (the block that appends to frame_nlocs using src_nlocs, referencing
variables frame_nlocs, src_nlocs, nframes, and loop index i) to guard access:
check that src_nlocs is not None and i < len(src_nlocs) before
converting/append; if missing, append a sensible default (e.g., 0) or log a
warning and continue so the merge does not abort.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Line 925: The loop uses an unused variable `req_info` in the iteration `for
key, req_info in all_keys.items()` — rename `req_info` to `_req_info` or `_` to
satisfy Ruff B007 and avoid unused-variable warnings; update that loop
occurrence (and any identical loops in the same file) and run `ruff check .` /
`ruff format .` before committing.

---

Duplicate comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 367-371: Validate that all atom type IDs in frame["atype"] are
within the valid range before any remapping or any use in natoms_vec
computation: check they are non-negative integers and less than the length/shape
of self._type_remap (when self._type_remap is not None) or less than expected
num_types; if any id is out of range raise a clear ValueError including the
offending ids and context (frame id), then proceed to reshape/astype, apply
self._type_remap on valid indices, and only after successful remap compute
natoms_vec; refer to symbols frame["atype"], self._type_remap and natoms_vec to
locate where to insert the validation and error handling.
- Around line 270-281: The batch_size handling currently silently coerces
invalid string specs and accepts non-positive integers; update the block that
sets self._auto_rule and self.batch_size (around batch_size, self._auto_rule,
and _compute_batch_size in lmdb_data.py) to strictly validate inputs: accept
only "auto" or "auto:<positive int>" for string specs (parse and set
self._auto_rule or raise ValueError on any other format), and when batch_size is
numeric ensure int(batch_size) > 0 (raise ValueError otherwise), also catch and
re-raise parsing errors with a clear message so invalid configs fail fast
instead of being coerced.
- Line 461: The local variable unique_nlocs is assigned via
sorted(self._nloc_groups.keys()) but never used, causing an F841; remove the
unused assignment (delete the unique_nlocs line) or replace its usage if it was
intended to be used later in the same function/method that contains
self._nloc_groups; ensure you reference the variable name unique_nlocs and the
expression sorted(self._nloc_groups.keys()) when making the change, then run
ruff check . and ruff format . before committing.
- Around line 848-855: get_test() can call max(...) on an empty
self._nloc_groups when the LMDB has zero frames; add an explicit guard at the
start of get_test() (or before the branching that uses self._nloc_groups) to
detect an empty dataset (e.g., if not self._frames or not self._nloc_groups) and
handle it cleanly by returning an empty result or raising a clear exception;
update the branch that sets natoms/frame_indices (the block using
self._nloc_groups, natoms, and frame_indices) to assume non-empty only after
this guard so max(...) is never called on an empty dict.
- Around line 1030-1032: The code unconditionally indexes src_nlocs[i] when
building frame_nlocs in the merge path, which can raise IndexError for
truncated/malformed metadata; update the merge logic (the block that appends to
frame_nlocs using src_nlocs, referencing variables frame_nlocs, src_nlocs,
nframes, and loop index i) to guard access: check that src_nlocs is not None and
i < len(src_nlocs) before converting/append; if missing, append a sensible
default (e.g., 0) or log a warning and continue so the merge does not abort.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ed63ac4d-0db4-446a-aaf9-6d84603c1c04

📥 Commits

Reviewing files that changed from the base of the PR and between d9a4db4 and 1a1f41d.

📒 Files selected for processing (1)

deepmd/dpmodel/utils/lmdb_data.py

coderabbitai

🧹 Nitpick comments (3)

deepmd/dpmodel/utils/lmdb_data.py (3)
762-765: Consider deterministic seeding for test reproducibility.

The shuffle uses np.random.default_rng() without a seed, making test results non-reproducible. While randomization is often desirable for testing, consider adding an optional seed parameter for debugging reproducibility.
♻️ Proposed enhancement
     def __init__(
         self,
         lmdb_path: str,
         type_map: list[str] | None = None,
         shuffle_test: bool = True,
+        seed: int | None = None,
         **kwargs: Any,
     ) -> None:
         ...
         # Shuffle if requested
         if shuffle_test:
-            rng = np.random.default_rng()
+            rng = np.random.default_rng(seed)
             indices = rng.permutation(len(self._frames))
             self._frames = [self._frames[i] for i in indices]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 762 - 765, The test shuffle
is non-deterministic because it calls np.random.default_rng() without a seed;
update the method (the code that checks shuffle_test and shuffles self._frames)
to accept an optional seed parameter (default None) and pass it into
np.random.default_rng(seed) so permutation(self._frames) becomes reproducible
when a seed is provided; ensure callers that invoke this behavior can pass the
seed (or leave None for non-deterministic behavior) and keep using indices =
rng.permutation(len(self._frames)) and self._frames = [self._frames[i] for i in
indices] unchanged.
1007-1045: Consider using context manager for source LMDB environments.

While src_env.close() is called, an exception during the loop could leave environments open. Using a context manager would ensure cleanup.
♻️ Proposed enhancement
     for src_path in src_paths:
-        src_env = _open_lmdb(src_path)
-        with src_env.begin() as txn:
-            meta = _read_metadata(txn)
-        nframes, src_fmt, natoms_per_type = _parse_metadata(meta)
-        ...
-        src_env.close()
+        with lmdb.open(src_path, readonly=True, lock=False, readahead=False, meminit=False) as src_env:
+            with src_env.begin() as txn:
+                meta = _read_metadata(txn)
+            nframes, src_fmt, natoms_per_type = _parse_metadata(meta)
+            ...
Note: Verify that lmdb.Environment supports context manager protocol (it does since lmdb 0.94).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 1007 - 1045, Wrap source LMDB
environments opened by _open_lmdb in a context manager to guarantee they are
closed on exceptions: replace the pattern "src_env = _open_lmdb(src_path); with
src_env.begin() as txn: ...; src_env.close()" with "with _open_lmdb(src_path) as
src_env: ..." and keep the existing nested "with src_env.begin() as src_txn,
dst_env.begin(write=True) as dst_txn:" block intact; remove the explicit
src_env.close() call and ensure code still references src_nlocs, src_txn,
frame_idx, and other variables unchanged.
245-247: Thread-safety assumption for persistent transaction.

The comment at Line 246 states the persistent transaction is "Safe because we use num_workers=0 in DataLoader." This is a critical assumption—if a caller uses num_workers > 0 with PyTorch DataLoader, the LMDB transaction will be shared across processes (via fork), which can cause corruption or hangs.

Consider adding a runtime check or explicit documentation in the class docstring warning against multi-worker usage, or lazily initializing the transaction per-worker.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 762-765: The test shuffle is non-deterministic because it calls
np.random.default_rng() without a seed; update the method (the code that checks
shuffle_test and shuffles self._frames) to accept an optional seed parameter
(default None) and pass it into np.random.default_rng(seed) so
permutation(self._frames) becomes reproducible when a seed is provided; ensure
callers that invoke this behavior can pass the seed (or leave None for
non-deterministic behavior) and keep using indices =
rng.permutation(len(self._frames)) and self._frames = [self._frames[i] for i in
indices] unchanged.
- Around line 1007-1045: Wrap source LMDB environments opened by _open_lmdb in a
context manager to guarantee they are closed on exceptions: replace the pattern
"src_env = _open_lmdb(src_path); with src_env.begin() as txn: ...;
src_env.close()" with "with _open_lmdb(src_path) as src_env: ..." and keep the
existing nested "with src_env.begin() as src_txn, dst_env.begin(write=True) as
dst_txn:" block intact; remove the explicit src_env.close() call and ensure code
still references src_nlocs, src_txn, frame_idx, and other variables unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 96e35842-9118-4449-b995-2d43315d81b6

📥 Commits

Reviewing files that changed from the base of the PR and between 1a1f41d and b14c6cc.

📒 Files selected for processing (1)

deepmd/dpmodel/utils/lmdb_data.py

+        del _ENV_CACHE[resolved]
+        try:
+            env.close()
+        except Exception:


njzjz

I feel that there are too many if...else blocks added in this PR and they don't have a good abstract.

+    from deepmd.utils.data_system import (
+        prob_sys_size_ext,
+    )


+    from deepmd.dpmodel.utils.lmdb_data import (
+        is_lmdb,
+    )


iProzd added 2 commits February 28, 2026 02:16

feat(pt dpmodel): add lmdb dataloader

8f554d9

update distributed sampler

5a10d4d

Copilot AI review requested due to automatic review settings March 3, 2026 07:56

github-actions bot added Python Examples labels Mar 3, 2026

iProzd marked this pull request as draft March 3, 2026 07:56

dosubot bot added the new feature label Mar 3, 2026

github-advanced-security AI found potential problems Mar 3, 2026

View reviewed changes

Copilot started reviewing on behalf of iProzd March 3, 2026 07:59 View session

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

Copilot AI reviewed Mar 3, 2026

View reviewed changes

iProzd added 3 commits March 3, 2026 18:41

Update pyproject.toml

599d221

Merge branch 'master' into 0225-lmdb-dataloader

d163b26

resolve dtype

f91ba7d

iProzd requested review from njzjz and wanghan-iapcm March 9, 2026 07:22

iProzd marked this pull request as ready for review March 9, 2026 07:22

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Comment thread deepmd/pt/train/training.py

wanghan-iapcm requested changes Mar 11, 2026

View reviewed changes

Comment thread examples/lmdb_downsample_data/input_lmdb.json

Comment thread source/tests/consistent/test_lmdb_data.py Outdated

fix type map

d9a4db4

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

iProzd added 2 commits March 16, 2026 20:09

fix type map for test

1a1f41d

Update lmdb_data.py

b14c6cc

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

Comment thread deepmd/dpmodel/utils/lmdb_data.py

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

njzjz linked an issue Mar 19, 2026 that may be closed by this pull request

Support New Data Format: deepmd/lmdb #5233

Closed

add autoprob

81ee6c6

github-advanced-security AI found potential problems Mar 19, 2026

View reviewed changes

Comment thread deepmd/dpmodel/utils/lmdb_data.py Fixed

Comment thread deepmd/dpmodel/utils/lmdb_data.py Fixed

Comment thread source/tests/consistent/test_lmdb_data.py Fixed

iProzd added 10 commits March 20, 2026 00:12

Merge branch 'master' into 0225-lmdb-dataloader

a465fab

fix efficiency

28b9300

fix multitask

792361e

fix neighbor-stat

6b275bf

fix repeat keys

24b9060

add ut for extra keys

78e2d5f

add example data

2a0f1aa

Create README.md

4777dd9

Update lmdb_data.py

1f1e18e

fix ut

c77bc9e

github-advanced-security AI found potential problems Mar 30, 2026

View reviewed changes

Comment thread deepmd/dpmodel/utils/lmdb_data.py

del _ENV_CACHE[resolved]

try:

env.close()

except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

iProzd added 3 commits March 31, 2026 15:59

Update test_lmdb_data.py

7db1111

Update test_lmdb_dataloader.py

22fde3d

Update test_lmdb_data.py

3e2e77b

iProzd requested a review from wanghan-iapcm April 1, 2026 09:22

wanghan-iapcm approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'master' into 0225-lmdb-dataloader

637cf68

njzjz requested changes Apr 6, 2026

View reviewed changes

Comment thread deepmd/dpmodel/utils/lmdb_data.py Outdated

Comment thread deepmd/dpmodel/utils/lmdb_data.py Outdated

Comment thread deepmd/entrypoints/test.py Outdated

Comment thread deepmd/entrypoints/test.py

Comment thread deepmd/pt/entrypoints/main.py Outdated

resolve comments

fb73297

iProzd requested a review from njzjz April 6, 2026 14:37

github-advanced-security AI found potential problems Apr 6, 2026

View reviewed changes

njzjz approved these changes Apr 6, 2026

View reviewed changes

fix ut

72736f7

iProzd enabled auto-merge April 7, 2026 08:32

iProzd added this pull request to the merge queue Apr 7, 2026

Merged via the queue into deepmodeling:master with commit 565f4be Apr 7, 2026
70 checks passed

iProzd deleted the 0225-lmdb-dataloader branch April 7, 2026 19:21

Conversation

iProzd commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Check notice

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Uh oh!

Check notice

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iProzd commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

codecov bot commented Mar 3, 2026 •

edited

Loading